Exploring Places

places_classes.png

Experiments:

    1. Visualizing Places dataset
    1. Exploring Tags Places
    1. Exploring Towns & Places Names
    1. Exploring Properities
    1. Exploring Descriptions Places Similarities
    1. Descriptions Places Topic Modelling

1. Importing libraries and loading the json file with 5000 events to a dataframe

In [2]:
import json
import pandas as pd
import plotly.express as px
import os
import plotly.graph_objects as go
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from bertopic import BERTopic
In [9]:
with open('places.json', 'r') as f:
    data = json.load(f)
df = pd.DataFrame(data)

2. Visualizing the places dataframe

In [11]:
df.shape[0]
Out[11]:
100

Experiment 1: Exploring Place Ids

In [12]:
df_ids=df.groupby(['place_id']).size().reset_index()
df_ids=df_ids.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
df_ids
Out[12]:
place_id number_of_times
0 00000322-2842-bb9a-e74e-0dd400007687 1
63 63c0eadd-4f1e-b60e-16e0-bcc400005c32 1
73 8cb7eadd-4f1e-b60e-93ea-5fc40131612c 1
72 8c04eadd-4f1e-b60e-f8e0-bcc401315121 1
71 8b20eadd-4f1e-b60e-65e0-bcc400003c3c 1
... ... ...
30 0000b307-0f42-bb9a-5f60-7df40000afc5 1
29 0000b182-f942-bb9a-33be-37050000c1d5 1
28 0000a786-3b42-bb9a-0825-c6050000b7eb 1
27 0000a574-7942-bb9a-77c7-d0f4000083ff 1
99 ffafdadd-4f1e-b60e-63e0-bcc4000030d4 1

100 rows × 2 columns

Experiment 2: Exploring Tags Places

We are going to separete the elements stored in each tag list into new rows.

In [13]:
df["tags"][0:5]
Out[13]:
0                   [parks]
1                        []
2    [town & village halls]
3                 [schools]
4    [town & village halls]
Name: tags, dtype: object
In [14]:
df_tags=df.explode('tags')
In [15]:
df_tags
Out[15]:
place_id list_id created_ts modified_ts name sort_name address town postal_code country_code lat lng tags descriptions properties website links images
0 1a8fdadd-4f1e-b60e-13e0-bcc4000028af 10415 2000-02-05T00:00:00 2000-02-05T00:00:00 Princess Royal & Duke of Fife Memorial Park Princess Royal & Duke of Fife Memorial Park Broombank Terrace Braemar AB35 5YX GB 57.00520 -3.40440 parks [] {} NaN NaN NaN
1 1ed0eadd-4f1e-b60e-36e0-bcc400002980 10624 2000-03-03T00:00:00 2000-03-03T00:00:00 City Centre City Centre (Glasgow) NaN Glasgow G1 GB 55.88330 -4.25000 NaN [] {'phone.info': '0141 204 4400'} NaN NaN NaN
2 7161eadd-4f1e-b60e-d6e0-bcc400002917 10519 2000-04-06T23:00:00 2000-04-06T23:00:00 Charlotte Toal Centre Charlotte Toal Centre Dundyvan Road Coatbridge ML5 1DB GB 55.85730 -4.03413 town & village halls [] {'phone.info': '01698 267515'} NaN NaN NaN
3 f160eadd-4f1e-b60e-a5e0-bcc40000275b 10075 2000-06-21T23:00:00 2000-06-21T23:00:00 Linlithgow Primary School Linlithgow Primary School Preston Road Linlithgow EH49 6EH GB 55.97162 -3.61258 schools [] {} NaN NaN NaN
4 8fbfdadd-4f1e-b60e-83e0-bcc400002c5c 11356 2000-07-09T23:00:00 2000-07-09T23:00:00 Tait Hall Tait Hall Edenside Road Kelso TD5 7BS GB 55.60140 -2.43025 town & village halls [] {'phone.info': '01573 224233'} NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 0000b182-f942-bb9a-33be-37050000c1d5 49621 2012-10-09T09:15:31 2012-10-09T12:32:24 Newick Village Newick Village Lewes GB 50.97525 0.01480 village [] {} http://www.newick.net/ NaN NaN
97 c5bfdadd-4f1e-b60e-73e0-bcc400002920 10528 2002-05-05T23:00:00 2012-10-10T11:28:01 The Cinema Cinema: Newton Stewart 33--35 Victoria Street Newton Stewart DG8 6NL GB 54.95880 -4.48262 cinemas [{'type': 'default', 'description': 'Run on a ... {'phone.info': '01671 403373'} http://www.nscinema.co.uk NaN NaN
98 558fdadd-4f1e-b60e-03e0-bcc4000030bf 12479 2001-10-25T23:00:00 2012-10-10T11:36:09 The Pavilion Pavilion: Galashiels Market Street Galashiels TD1 3AF GB 55.61570 -2.80571 cinemas [{'type': 'default', 'description': 'First ope... {} http://www.pavilioncinema.co.uk NaN NaN
99 0000d282-0a42-bb9a-e594-47050000c1ee 49646 2012-10-09T15:57:18 2012-10-10T11:37:19 St Paul's School St Paul's School Lonsdale Road London SW13 9JT GB 51.48768 -0.23991 public buildings [] {'phone.info': '0208 748 9162'} http://www.stpaulsschool.org.uk/ NaN NaN
99 0000d282-0a42-bb9a-e594-47050000c1ee 49646 2012-10-09T15:57:18 2012-10-10T11:37:19 St Paul's School St Paul's School Lonsdale Road London SW13 9JT GB 51.48768 -0.23991 school [] {'phone.info': '0208 748 9162'} http://www.stpaulsschool.org.uk/ NaN NaN

159 rows × 18 columns

In [16]:
g_tags=df_tags.groupby(['tags']).size().reset_index()
g_tags=g_tags.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
g_tags
Out[16]:
tags number_of_times
5 cinemas 23
31 public buildings 15
49 town & village halls 10
28 outdoors 8
32 pubs & bars 6
48 tourist attractions 5
29 parks 4
20 hi-arts 3
50 university 3
51 venues 3
46 temporary 3
39 sporting venues 3
8 community centre 3
7 clubs 3
34 restaurants 3
2 church 3
1 bar & pub food 2
40 sports & leisure centres 2
44 student friendly 2
54 visitor centre 2
47 theatres 2
15 festivals 2
14 festival 2
13 eusa 2
10 country parks 2
9 community centres 2
4 churches 2
45 swimming 1
36 schools 1
43 stately homes 1
42 stadiums 1
41 stadium 1
52 village 1
38 shops 1
53 village hall 1
37 shopping centre 1
0 accommodation 1
35 school 1
33 real ale 1
3 church hall 1
6 club 1
11 deuchars 1
12 education 1
16 forests 1
17 galleries 1
18 golf course 1
19 golf courses 1
21 historic site 1
22 hotels 1
23 library 1
24 lochs 1
25 music venue 1
26 nightclubs 1
30 promenades 1
27 outdoor 1
In [17]:
px.histogram(g_tags, x="tags", y="number_of_times", histfunc="sum", color="tags", title='Frequency of tags places')

Experiment 3: Exploring Towns & Names

In [18]:
df["town"][1:10]
Out[18]:
1       Glasgow
2    Coatbridge
3    Linlithgow
4         Kelso
5       Carluke
6     Lockerbie
7    Strathaven
8          Tain
9    Galashiels
Name: town, dtype: object

3.1 Frequency of places grouped by towns

In [19]:
df_town=df.dropna(subset=['town'])
town=df_town.groupby(['town']).size().reset_index()
town=town.rename(columns={0: "number_of_times"})
town=town.drop([0])
In [20]:
town=town.sort_values(by=['number_of_times'], ascending=False)
town
Out[20]:
town number_of_times
24 Glasgow 6
19 Edinburgh 5
31 Inverness 4
57 Paisley 2
23 Galashiels 2
... ... ...
36 Kelso 1
2 Banchory 1
39 Knutsford 1
40 Leven 1
73 Winchester 1

73 rows × 2 columns

In [21]:
px.scatter(town, x='town', y='number_of_times', color='number_of_times',  size="number_of_times", size_max=60, title="Frequency of places grouped by towns")

3.2 Frequency of places grouped by name

In [22]:
df_name_town=df.groupby(['name']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town.reset_index()
Out[22]:
index name number_of_times
0 89 Town Hall 2
1 0 Angus Archives 1
2 62 Portobello Promenade 1
3 72 Showcase Cardiff Nantgarw 1
4 71 Sherdley Park 1
... ... ... ...
94 30 Kemnay Church Centre 1
95 29 Inverness Ice Centre 1
96 28 Innerleithen 1
97 27 Ice Factory 1
98 98 West Church Hall 1

99 rows × 3 columns

3.3. Frequency of places grouped by name and town

In [23]:
df_name_town=df.groupby(['name', 'town']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town
Out[23]:
name town number_of_times
0 Angus Archives Forfar 1
73 Silverburn Glasgow 1
71 Showcase Cinema Glasgow Glasgow 1
70 Showcase Cardiff Nantgarw Nr Pontypridd 1
69 Sherdley Park St Helens 1
... ... ... ...
30 Knockburn Loch Banchory 1
29 Kemnay Village Hall Kemnay 1
28 Kemnay Church Centre Inverurie 1
27 Inverness Ice Centre Inverness 1
97 West Church Hall Dundee 1

98 rows × 3 columns

Experiment 4: Exploring Properities

In [24]:
df_properties=pd.concat([df.drop(['properties'], axis=1), df['properties'].apply(pd.Series)], axis=1)
In [25]:
df_properties[0:3]
Out[25]:
place_id list_id created_ts modified_ts name sort_name address town postal_code country_code ... website links images phone.info place.times.seasonal place.facilities.parking place.facilities.toilets place.facilities.free-wifi place.facilities.toilets_disabled place.facilities.wheelchair-access
0 1a8fdadd-4f1e-b60e-13e0-bcc4000028af 10415 2000-02-05T00:00:00 2000-02-05T00:00:00 Princess Royal & Duke of Fife Memorial Park Princess Royal & Duke of Fife Memorial Park Broombank Terrace Braemar AB35 5YX GB ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1ed0eadd-4f1e-b60e-36e0-bcc400002980 10624 2000-03-03T00:00:00 2000-03-03T00:00:00 City Centre City Centre (Glasgow) NaN Glasgow G1 GB ... NaN NaN NaN 0141 204 4400 NaN NaN NaN NaN NaN NaN
2 7161eadd-4f1e-b60e-d6e0-bcc400002917 10519 2000-04-06T23:00:00 2000-04-06T23:00:00 Charlotte Toal Centre Charlotte Toal Centre Dundyvan Road Coatbridge ML5 1DB GB ... NaN NaN NaN 01698 267515 NaN NaN NaN NaN NaN NaN

3 rows × 24 columns

4.1 Frequency of places grouped by wheelchair-access and town

In [26]:
df_properties_wc=df_properties.groupby(['place.facilities.wheelchair-access', 'town']).size().reset_index()
df_properties_wc=df_properties_wc.rename(columns={0: "number_of_times"})
df_properties_wc=df_properties_wc.sort_values(by=['number_of_times'], ascending=False)
df_properties_wc
Out[26]:
place.facilities.wheelchair-access town number_of_times
0 True Milngavie 1

4.2 Frequency of places grouped by toilets_disabled and town

5. Exploring Descriptions

In [31]:
df_descriptions=df.explode('descriptions')
df_descriptions=pd.concat([df_descriptions.drop(['descriptions'], axis=1), df_descriptions['descriptions'].apply(pd.Series)], axis=1)
df_descriptions=df_descriptions.dropna(subset=['description']).reset_index()
documents=df_descriptions["description"].values
In [32]:
len(documents)
Out[32]:
13
In [36]:
import re 
from gensim.parsing.preprocessing import remove_stopwords
def clean_documents(text):
    text = re.sub(r'\S*@\S*\s?', '', text, flags=re.MULTILINE) # remove email
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE) # remove web addresses
    text = re.sub("\'", "", text) # remove single quotes
    text = remove_stopwords(text)
    return text
In [37]:
d=[]
for text in documents:
    d.append(clean_documents(text))

Generating Text Embeddings

In [38]:
model = SentenceTransformer('all-MiniLM-L6-v2')
#Training our text_embeddings - using the descriptions available & all-MiniLM-L6-v2 Transformer
text_embeddings = model.encode(documents, batch_size = 8, show_progress_bar = True)

In [39]:
np.shape(text_embeddings)
Out[39]:
(13, 384)

Description Similarity

In [40]:
similarities = cosine_similarity(text_embeddings)
similarities_sorted = similarities.argsort()
id_1 = []
id_2 = []
score = []
for index,array in enumerate(similarities_sorted):
    p=len(array)
    id_1.append(index)
    id_2.append(array[-2])
    score.append(similarities[index][array[-2]])
index_df = pd.DataFrame({'id_1' : id_1,
                          'id_2' : id_2,
                          'score' : score})
print(index_df)
    id_1  id_2     score
0      0     5  0.301427
1      1     6  0.442145
2      2     9  0.519278
3      3     7  0.574173
4      4     2  0.509976
5      5     7  0.679617
6      6     1  0.442145
7      7     5  0.679617
8      8     7  0.653112
9      9     7  0.536088
10    10     2  0.478014
11    11     3  0.372194
12    12     3  0.415382
In [41]:
index_df["score"].sort_values(ascending=False)
Out[41]:
5     0.679617
7     0.679617
8     0.653112
3     0.574173
9     0.536088
2     0.519278
4     0.509976
10    0.478014
1     0.442145
6     0.442145
12    0.415382
11    0.372194
0     0.301427
Name: score, dtype: float32
In [44]:
index_df.iloc[5]
Out[44]:
id_1     5.000000
id_2     7.000000
score    0.679617
Name: 5, dtype: float64

NOTE: Documents 5 and 7 seems to be the most similar. Lets see what they have

In [46]:
documents[5]
Out[46]:
'As part of the McArthur Glen Designer Outlet, this eight screen multiplex is surrounded by restaurants, shops and plentiful parking.'
In [47]:
documents[7]
Out[47]:
'Part of the Barrbridge Leisure Centre, this 14 screen multiplex is well served by local eateries and plentiful parking.'